75 research outputs found

    GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4

    Full text link
    This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect translation quality errors, specifically for the quality estimation setting without the need for human reference translations. Based on the power of large language models (LLM), GEMBA-MQM employs a fixed three-shot prompting technique, querying the GPT-4 model to mark error quality spans. Compared to previous works, our method has language-agnostic prompts, thus avoiding the need for manual prompt preparation for new languages. While preliminary results indicate that GEMBA-MQM achieves state-of-the-art accuracy for system ranking, we advise caution when using it in academic works to demonstrate improvements over other methods due to its dependence on the proprietary, black-box GPT model.Comment: Accepted to WMT 202

    Large Language Models Are State-of-the-Art Evaluators of Translation Quality

    Full text link
    We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate nine versions of GPT models, including ChatGPT and GPT-4. We show that our method for translation quality assessment only works with GPT~3.5 and larger models. Comparing to results from WMT22's Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.Comment: Accepted in EAMT, 10 pages, 8 tables, one figur

    Hybrid machine translation using binary classification models trained on joint, binarised feature vectors

    Get PDF
    We describe the design and implementation of a system combination method for machine translation output. It is based on sentence selection using binary classification models estimated on joint, binarised feature vectors. By contrast to existing system combination methods which work by dividing candidate translations into n-grams, i.e., sequences of n words or tokens, our framework performs sentence selection which does not alter the selected, best translation. First, we investigate the potential performance gain attainable by optimal sentence selection. To do so, we conduct the largest meta-study on data released by the yearly Workshop on Statistical Machine Translation (WMT). Second, we introduce so-called joint, binarised feature vectors which explicitly model feature value comparison for two systems A, B. We compare different settings for training binary classifiers using single, joint, as well as joint, binarised feature vectors. After having shown the potential of both selection and binarisation as methodological paradigms, we combine these two into a combination framework which applies pairwise comparison of all candidate systems to determine the best translation for each individual sentence. Our system is able to outperform other state-of-the-art system combination approaches; this is confirmed by our experiments. We conclude by summarising the main findings and contributions of our thesis and by giving an outlook to future research directions.Wir beschreiben den Entwurf und die Implementierung eines Systems zur Kombination von Übersetzungen auf Basis nicht modifizierender Auswahl gegebener Kandidaten. Die zugehörigen, binĂ€ren Klassifikationsmodelle werden unter Verwendung von gemeinsamen, binĂ€risierten Merkmalsvektoren trainiert. Im Gegensatz zu anderen Methoden zur Systemkombination, die die gegebenen KandidatenĂŒbersetzungen in n-Gramme, d.h., Sequenzen von n Worten oder Symbolen zerlegen, funktioniert unser Ansatz mit Hilfe von nicht modifizierender Auswahl der besten Übersetzung. Zuerst untersuchen wir das Potenzial eines solches Ansatzes im Hinblick auf die maximale theoretisch mögliche Verbesserung und fĂŒhren die grĂ¶ĂŸte Meta-Studie auf Daten, welche jĂ€hrlich im Rahmen der Arbeitstreffen zur Statistischen Maschinellen Übersetzung (WMT) veröffentlicht worden sind, durch. Danach definieren wir sogenannte gemeinsame, binĂ€risierte Merkmalsvektoren, welche explizit den Merkmalsvergleich zweier Systeme A, B modellieren. Wir vergleichen verschiedene Konfigurationen zum Training binĂ€rer Klassifikationsmodelle basierend auf einfachen, gemeinsamen, sowie gemeinsamen, binĂ€risierten Merkmalsvektoren. Abschließend kombinieren wir beide Verfahren zu einer Methodik, die paarweise Vergleiche aller Quellsysteme zur Bestimmung der besten Übesetzung einsetzt. Wir schließen mit einer Zusammenfassung und einem Ausblick auf zukĂŒnftige Forschungsthemen

    Can Machine Learning Algorithms Improve Phrase Selection in Hybrid Machine Translation

    Get PDF
    Abstract We describe a substitution-based, hybrid machine translation (MT) system that has been extended with a machine learning component controlling its phrase selection. Our approach is based on a rule-based MT (RBMT) system which creates template translations. Based on the generation parse tree of the RBMT system and standard word alignment computation, we identify potential "translation snippets" from one or more translation engines which could be substituted into our translation templates. The substitution process is controlled by a binary classifier trained on feature vectors from the different MT engines. Using a set of manually annotated training data, we are able to observe improvements in terms of BLEU scores over a baseline version of the hybrid system

    Results from the ML4HMT-12 shared task on applying machine learning techniques to optimise the division of labour in hybrid machine translation

    Get PDF
    We describe the second edition of the ML4HMT shared task which challenges participants to create hybrid translations from the translation output of several individual MT systems. We provide an overview of the shared task and the data made available to participants before briefly describing the individual systems. We report on the results using automatic evaluation metrics and conclude with a summary of ML4HMT-12 and an outlook to future work

    Findings of the 2019 Conference on Machine Translation (WMT19)

    Get PDF
    This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation

    Iterative Data Augmentation for Neural Machine Translation: a Low Resource Case Study for English–Telugu

    Get PDF
    Telugu is the fifteenth most commonly spoken language in the world with an estimated reach of 75 million people in the Indian subcontinent. At the same time, it is a severely low resourced language. In this paper, we present work on English–Telugu general domain machine translation (MT) systems using small amounts of parallel data. The baseline statistical (SMT) and neural MT (NMT) systems do not yield acceptable translation quality, mostly due to limited resources. However, the use of synthetic parallel data (generated using back translation, based on an NMT engine) significantly improves translation quality and allows NMT to outperform SMT. We extend back translation and propose a new, iterative data augmentation (IDA) method. Filtering of synthetic data and IDA both further boost translation quality of our final NMT systems, as measured by BLEU scores on all test sets and based on state-of-the-art human evaluation

    Tumor Heterogeneity in Lymphomas: A Different Breed.

    Get PDF
    The facts that cancer represents tissues consisting of heterogeneous neoplastic, as well as reactive, cell populations and that cancers of the same histotype may show profound differences in clinical behavior have long been recognized. With the advent of new technologies and the demands of precision medicine, the investigation of tumor heterogeneity has gained much interest. An understanding of intertumoral heterogeneity in patients with the same disease entity is necessary to optimally guide personalized treatment. In addition, increasing evidence indicates that different tumor areas or primary tumors and metastases in an individual patient can show significant intratumoral heterogeneity on different levels. This phenomenon can be driven by genomic instability, epigenetic events, the tumor microenvironment, and stochastic variations in cellular function and antitumoral therapies. These mechanisms may lead to branched subclonal evolution from a common progenitor clone, resulting in spatial variation between different tumor sites, disease progression, and treatment resistance. This review addresses tumor heterogeneity in lymphomas from a pathologist's viewpoint. The relationship between morphologic, immunophenotypic, and genetic heterogeneity is exemplified in different lymphoma entities and reviewed in the context of high-grade transformation and transdifferentiation. In addition, factors driving heterogeneity, as well as clinical and therapeutic implications of lymphoma heterogeneity, will be discussed

    Towards Automatic Face-to-Face Translation

    Full text link
    In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact on multiple real-world applications. First, we build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio. Quantitative evaluation of LipGAN on the standard LRW test set shows that it significantly outperforms existing approaches across all standard metrics. We also subject our Face-to-Face Translation pipeline, to multiple human evaluations and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages. Code, models and demo video are made publicly available. Demo video: https://www.youtube.com/watch?v=aHG6Oei8jF0 Code and models: https://github.com/Rudrabha/LipGANComment: 9 pages (including references), 5 figures, Published in ACM Multimedia, 201

    Machine Translation Human Evaluation: an investigation of evaluation based on Post-Editing and its relation with Direct Assessment

    Get PDF
    In this paper we present an analysis of the two most prominent methodologies used for the human evaluation of MT quality, namely evaluation based on Post-Editing (PE) and evaluation based on Direct Assessment (DA). To this purpose, we exploit a publicly available large dataset containing both types of evaluations. We first focus on PE and investigate how sensitive TER-based evaluation is to the type and number of references used. Then, we carry out a comparative analysis of PE and DA to investigate the extent to which the evaluation results obtained by methodologies addressing different human perspectives are similar. This comparison sheds light not only on PE but also on the so-called reference bias related to monolingual DA. Also, we analyze if and how the two methodologies can complement each other’s weaknesses
    • 

    corecore